Bugfix in distributed GPU tests and Distributed `set!` #3880

simone-silvestri · 2024-10-29T07:53:48Z

This PR modifies the configuration of the distributed pipeline that runs on the Caltech cluster to allow using CUDA-aware MPI.

closes #3897

test/utils_for_runtests.jl

…nto ss/fix-gpu-tests

ali-ramadhan

Looks good to me!

glwagner · 2024-11-06T17:55:55Z

.buildkite/distributed/pipeline.yml

    commands:
      - "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
    agents:
-      slurm_mem: 120G
+      slurm_mem: 8G


120G is much more than we need for those tests. After some frustration, because tests were extremely slow to start, I noticed that the agents began much quicker by requesting a smaller memory amount. So I am deducing that the tests run on shared nodes instead of exclusive ones, and requesting lower resources allows us to squeeze in when the cluster is busy.

good reason. might warrant a comment

glwagner · 2024-11-06T17:56:07Z

Project.toml


 [targets]
-test = ["DataDeps", "Enzyme", "SafeTestsets", "Test", "TimesDates"]
+test = ["DataDeps", "SafeTestsets", "Test", "Enzyme", "MPIPreferences", "TimesDates"]


Was this the crucial part?

glwagner

This looks good. For future generations, can you please write a little bit about what you tried and what ended up working? I can't tell if all the changes are necessary, though the end result is fairly clean. Mostly I am wondering about slurm_mem. I'm also curious why we cannot call precompile_runtime inside runtests.jl and it is necessary to call it before Pkg.test(). This has implications for the CI of other packages.

simone-silvestri · 2024-11-07T08:41:28Z

I think it is equivalent. I am trying to precompile inside the runtests.

Btw, having again access to GPU distributed tests highlighted a bug related to distributed architectures specifically for the set! function, which I am fixing in this PR.

simone-silvestri · 2024-11-07T09:02:18Z

there are two distinct issues with the GPU tests

MPI was not CUDA-aware
CUDA_runtime was not found in the tests

Issue number (1) is solved, but, unfortunately, some tests fail stochastically because CUDA_runtime is not found.
It seems to be linked to how late the particular test is started after the init step, i.e. tests that start much later will fail with CUDA_runtime not found error. I am still investigating

simone-silvestri · 2024-11-07T11:49:04Z

Ok, with some fiddling CUDA seems to be found correctly now. I think this

Oceananigans.jl/Project.toml

Line 81 in 908b31a

CUDA_Runtime_jll = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"

is the change that made it work.

However, I have added a failsafe option following this suggestion (which was the error we were encountering)
https://github.com/JuliaGPU/CUDA.jl/blob/a085bbb3d7856dfa929e6cdae04a146a259a2044/src/initialization.jl#L96
which reloads and compiles CUDA_Runtime_jll and restarts the julia session.

glwagner · 2024-11-07T13:10:36Z

.buildkite/distributed/pipeline.yml

@@ -51,7 +51,7 @@ steps:
    commands:
      - "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
    agents:
-      slurm_mem: 8G
+      slurm_mem: 50G


we can probably reduce the memory usage of the tests right? I think often a bigger grid is used than needed

Right, unit tests do not require too much memory. I have seen that 32G was not enough for the regression tests on the GPU.

they might be too big

…nto ss/fix-gpu-tests

simone-silvestri added 3 commits October 29, 2024 08:46

fix pipeline

5227a13

mpi test and gpu test

1225061

do we need to precompile it inside?

1652c6b

simone-silvestri mentioned this pull request Oct 29, 2024

Segmentation fault filling halo regions with Partition(y=2) #3878

Open

simone-silvestri added 5 commits October 29, 2024 11:49

precompile inside the node

9323203

try previous climacommon version

37b17ff

go even more back

2ac8cde

use the ClimaOcean implementation

0eb2720

using the ClimaOcean implementation

50d0ec3

glwagner reviewed Oct 29, 2024

View reviewed changes

test/utils_for_runtests.jl Outdated Show resolved Hide resolved

glwagner reviewed Oct 29, 2024

View reviewed changes

test/utils_for_runtests.jl Outdated Show resolved Hide resolved

glwagner reviewed Oct 29, 2024

View reviewed changes

test/utils_for_runtests.jl Outdated Show resolved Hide resolved

glwagner marked this pull request as ready for review October 29, 2024 16:01

simone-silvestri and others added 13 commits October 30, 2024 13:39

see if this test passes

6e183bd

Merge branch 'main' into ss/fix-gpu-tests

bd84d38

maybe precompiling before...

c56b15b

Merge branch 'ss/fix-gpu-tests' of github.com:CliMA/Oceananigans.jl i…

371a45b

…nto ss/fix-gpu-tests

double O0

e30973f

back to previous clima_common

e4cb16e

another quick test

0c1f01c

change environment

bec1cd1

correct the utils

75546af

Merge branch 'main' into ss/fix-gpu-tests

5f49ec0

this should load mpitrampoline

9b334af

Fix formatting

f8c6401

Go back to latest climacommon

1dc42bb

simone-silvestri mentioned this pull request Nov 5, 2024

The MPI we use in the distributed tests is not CUDA-aware #3897

Closed

simone-silvestri added 3 commits November 5, 2024 09:54

try adding Manifest

5a870e7

Manifest from julia 1.10

9e63f56

we probably need to initialize on a GPU

59548f8

make sure we don't run OOM

bc53a97

ali-ramadhan approved these changes Nov 6, 2024

View reviewed changes

glwagner reviewed Nov 6, 2024

View reviewed changes

glwagner approved these changes Nov 6, 2024

View reviewed changes

simone-silvestri added 2 commits November 7, 2024 09:13

bugfix in set!

811bfdb

try precompile inside runtests

cd86a6a

simone-silvestri changed the title ~~New strategy for defining architecture in distributed tests~~ Bugfix in distributed GPU tests and Distributed set! Nov 7, 2024

revert back

4039299

recompile everywhere

2c6ad90

simone-silvestri mentioned this pull request Nov 7, 2024

static_column_depth interface and numerical bottom height in AbstractGridFittedBottom #3841

Merged

simone-silvestri added 3 commits November 7, 2024 11:42

try nuclear option

781992c

skip all these commands

08949b3

some failsafe option

908b31a

simone-silvestri added 3 commits November 7, 2024 13:51

increase a bit the memory

466ec0c

comment

a27b383

whoops unit tests are small

eec18c2

glwagner reviewed Nov 7, 2024

View reviewed changes

simone-silvestri and others added 6 commits November 8, 2024 08:58

Merge branch 'main' into ss/fix-gpu-tests

62c5834

increase memory limits

8011ef5

Merge branch 'ss/fix-gpu-tests' of github.com:CliMA/Oceananigans.jl i…

0965067

…nto ss/fix-gpu-tests

tests were running on the CPU on sverdrup

cd00381

Merge branch 'main' into ss/fix-gpu-tests

eebfc04

Merge branch 'main' into ss/fix-gpu-tests

8fc903e

navidcy added the bug 🐞 Even a perfect program still has bugs label Nov 9, 2024

navidcy merged commit bfee493 into main Nov 9, 2024
46 checks passed

simone-silvestri mentioned this pull request Nov 28, 2024

Why do we only build distributed test_architectures if MPI.Initialized? #3879

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix in distributed GPU tests and Distributed `set!` #3880

Bugfix in distributed GPU tests and Distributed `set!` #3880

simone-silvestri commented Oct 29, 2024 •

edited

Loading

ali-ramadhan left a comment

glwagner Nov 6, 2024

simone-silvestri Nov 7, 2024 •

edited

Loading

glwagner Nov 7, 2024

glwagner Nov 6, 2024

glwagner left a comment

simone-silvestri commented Nov 7, 2024 •

edited

Loading

simone-silvestri commented Nov 7, 2024 •

edited

Loading

simone-silvestri commented Nov 7, 2024

glwagner Nov 7, 2024

glwagner Nov 7, 2024

simone-silvestri Nov 7, 2024

glwagner Nov 7, 2024

Bugfix in distributed GPU tests and Distributed set! #3880

Bugfix in distributed GPU tests and Distributed set! #3880

Conversation

simone-silvestri commented Oct 29, 2024 • edited Loading

ali-ramadhan left a comment

Choose a reason for hiding this comment

glwagner Nov 6, 2024

Choose a reason for hiding this comment

simone-silvestri Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

glwagner Nov 7, 2024

Choose a reason for hiding this comment

glwagner Nov 6, 2024

Choose a reason for hiding this comment

glwagner left a comment

Choose a reason for hiding this comment

simone-silvestri commented Nov 7, 2024 • edited Loading

simone-silvestri commented Nov 7, 2024 • edited Loading

simone-silvestri commented Nov 7, 2024

glwagner Nov 7, 2024

Choose a reason for hiding this comment

glwagner Nov 7, 2024

Choose a reason for hiding this comment

simone-silvestri Nov 7, 2024

Choose a reason for hiding this comment

glwagner Nov 7, 2024

Choose a reason for hiding this comment

Bugfix in distributed GPU tests and Distributed `set!` #3880

Bugfix in distributed GPU tests and Distributed `set!` #3880

simone-silvestri commented Oct 29, 2024 •

edited

Loading

simone-silvestri Nov 7, 2024 •

edited

Loading

simone-silvestri commented Nov 7, 2024 •

edited

Loading

simone-silvestri commented Nov 7, 2024 •

edited

Loading